Overall Analysis

Column

Insurance Dataset

Column

A glimpse

Rows: 1,338
Columns: 7
$ age      <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1…
$ sex      <chr> "female", "male", "male", "male", "male", "female", "female",…
$ bmi      <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74…
$ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0…
$ smoker   <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ region   <chr> "southwest", "southeast", "southeast", "northwest", "northwes…
$ charges  <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,…

There are 1338 observations of 7 variables.

By Region

Column

Insurance Claims by Region

Notes

Insurance claims appear to be distributed relatively evenly between regions; however, the Southeast region has a slight edge over the other three.

Column

Smoking by Region

Notes

Interestingly, the Southeast region also has slightly more smokers than the other three regions.

Column

BMI by Region

Notes

BMI also appears to be greatest in the Southeast region.

BMI Distribution

Column

Distribution of BMI

Column

Notes

BMI appears to be fairly normally distributed; however, there is a very slight right skew.

Insurance Charges

Column

Distribution of Charges

Notes

The distribution of charges is NOT normally distributed. It has a heavy right skew.

Column

Insurance Charges by Age

Notes

Most points on the scatter plot are bunched towards the bottom of the plot. However, there is still a steady increase in the claim amount.

Column

Insurance Charges by Age and Smoker Status

Notes

There is a clear difference between the claim amounts of smokers and non-smokers. Smokers generally make larger claims, and the claim amount increases as age increases. Non-smokers generally make smaller claims, but the claim amount still increases with age.

Validity of Trendlines

Column

Trendline 1

Notes

In this scenario, the trendline is not a valid representation of the data. There are clearly two groups in this plot, and the trendline does not accurately represent either of them.

Column

Trendline 2

Notes

In this scenario, the trendline does represent the data fairly well. There are a few outliers, but the trendline is fairly accurate for most members of the group.

To further separate the data, it’s possible that separating by gender could be informative, or possibly separating by disease status (cancer vs. non-cancer).

Number of Children and Insurance Charges

Column

Distribution of Number of Children

Notes

The plurality of insurance claims come from those with zero children. Fewer claims are made by people with more children.

Column

Distribution of Charges by Number of Children

Notes

Those with many children (4 or 5) make fewer very expensive insurance claims than others (fewest outliers). There are many outliers when it comes to those with zero children. Those with 1, 2, or 3 children make similar insurance claims.

---
title: "Assignment 7"
output: 
  flexdashboard::flex_dashboard:
    theme: 
      version: 4
      bootswatch: default
      navbar-bg: "hotpink"
    orientation: columns
    vertical_layout: fill
    source_code: embed
---

```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
```

Overall Analysis
===

Column {data-width=1000}
---

```{r}

```

Insurance Dataset
===

Column {data-width=1000}
---

### A glimpse

```{r}
insurance <- read_csv("~/Desktop/MTH209/Labs/insurance.csv")
glimpse(insurance)
```

There are 1338 observations of 7 variables.

By Region
===

Column {data-width=300}
---

### Insurance Claims by Region

```{r}
insurance %>% ggplot(aes(x=region)) + geom_bar(fill = "magenta") + labs(title="Number of Health Insurance Claims by Region", x="Region", y="Count")
```

### Notes

Insurance claims appear to be distributed relatively evenly between regions; however, the Southeast region has a slight edge over the other three.

Column {data-width=300}
---

### Smoking by Region

```{r}
smoker_proportions <- insurance %>%
  group_by(region, smoker) %>%
  summarise(count = n()) %>%
  mutate(proportion = count / sum(count))

ggplot(smoker_proportions, aes(x = region, y = proportion, fill = smoker)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Proportion of Smokers by Region",
       x = "Region",
       y = "Percentage",
       fill = "Smoker")
```

### Notes

Interestingly, the Southeast region also has slightly more smokers than the other three regions.

Column {data-width=300}
---

### BMI by Region

```{r}
insurance %>% ggplot(aes(x=region, y=bmi)) + geom_boxplot(color = "magenta", fill = "limegreen") + labs(title="BMI based on Region", x="Region", y="BMI")
```

### Notes

BMI also appears to be greatest in the Southeast region.

BMI Distribution
===

Column {data-width=1}
---

### Distribution of BMI

```{r}
insurance %>% ggplot(aes(x=bmi)) + geom_histogram(color="hotpink", fill="magenta") + labs(title="BMI Distribution", x="BMI", y="Count")
```

Column {data-width=1}
---

### Notes

BMI appears to be fairly normally distributed; however, there is a very slight right skew.

Insurance Charges
===

Column {data-width=1}
---

### Distribution of Charges

```{r}
insurance %>% ggplot(aes(x=charges)) + geom_histogram(color="hotpink", fill="magenta") + labs(title="Insurance Charge Distribution", x="Charges", y="Count")
```

### Notes

The distribution of charges is NOT normally distributed. It has a heavy right skew.

Column {data-width=1}
---

### Insurance Charges by Age

```{r}
insurance %>% ggplot(aes(x=age, y=charges)) + geom_point(color="magenta") + labs(title="Age vs. Insurance Claim Charge", x="Age", y="Charge")
```

### Notes

Most points on the scatter plot are bunched towards the bottom of the plot. However, there is still a steady increase in the claim amount.

Column {data-width=1}
---

### Insurance Charges by Age and Smoker Status

```{r}
insurance %>% ggplot(aes(x=age, y=charges, color=smoker)) + geom_point() + labs(title="Age vs. Insurance Claim Amount, Depending on Smoker Status", x="Age", y="Claim Amount")
```

### Notes

There is a clear difference between the claim amounts of smokers and non-smokers. Smokers generally make larger claims, and the claim amount increases as age increases. Non-smokers generally make smaller claims, but the claim amount still increases with age.

Validity of Trendlines
===

Column {data-width=1}
---

### Trendline 1

```{r}
smoker <- insurance %>% filter(smoker == "yes")
nonsmoker <- insurance %>% filter(smoker == "no")

smoker %>% ggplot(aes(x=age, y=charges)) + geom_point(color="red") + geom_smooth() + labs(title="Insurance Charges for Smokers", x="Age", y="Charges")
```

### Notes

In this scenario, the trendline is not a valid representation of the data. There are clearly two groups in this plot, and the trendline does not accurately represent either of them.

Column {data-width=1}
---

### Trendline 2

```{r}
nonsmoker %>% ggplot(aes(x=age, y=charges)) + geom_point(color="red") + geom_smooth() + labs(title="Insurance Charges for Nonsmokers", x="Age", y="Charges")
```

### Notes

In this scenario, the trendline does represent the data fairly well. There are a few outliers, but the trendline is fairly accurate for most members of the group.

To further separate the data, it's possible that separating by gender could be informative, or possibly separating by disease status (cancer vs. non-cancer).

Number of Children and Insurance Charges
===

Column {data-width=1}
---

### Distribution of Number of Children

```{r}
children_counts <- insurance %>%
  count(children)


ggplot(children_counts, aes(x = "", y = n, fill = as.factor(children))) +
  geom_bar(stat = "identity") +
  coord_polar("y", start = 0) +
  labs(title = "Distribution of Number of Children",
       fill = "Number of Children",
       x = NULL, y = NULL) +
  theme_void() +
  theme(legend.position = "right")
```

### Notes

The plurality of insurance claims come from those with zero children. Fewer claims are made by people with more children.

Column {data-width=1}
---

### Distribution of Charges by Number of Children

```{r}
insurance %>% ggplot(aes(x=as.factor(children), y=charges)) + geom_boxplot(fill="skyblue", color="blue") + labs(title="Distribution of Charges Based on Number of Children", x="Number of Children", y="Charges")
```

### Notes

Those with many children (4 or 5) make fewer very expensive insurance claims than others (fewest outliers). There are many outliers when it comes to those with zero children. Those with 1, 2, or 3 children make similar insurance claims.